Rates of convergence for stochastic approximation

Aditya Mahajan

McGill University

SIAM CT25 Tutorial

30 Jul, 2025

Discussion so far …

  • Asymptotic convergence of stochastic approximation \[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \] Under appropriate technical assumptions \(θ_t \to θ^*\) almost surely or in mean-square.
  • But for practical algorithms, we care about rates of convergence

    • Asymptotic rate (i.e., when \(t\) is large)
    • Non-asymptotic rate (i.e., for all values of \(t\))

An example

Stochastic approximation with \(f(θ) = -0.5 θ\) and \(α_t = C/(t+1)\)

Objective of the tutorial

  • Results on non-asymptotic rate of convergence appear intimidating
  • Are derived under different technical assumptions and at time require tedious algebra and book-keeping
  • Explain the key results
  • … and present the intuition behind the key technical ideas
  • Present only a few of the results, in their simplest setting
  • Do not explicitly keep track of constants for pedagogic clarity
  • Do not present a detailed attribution of each result but only a few key references

Modeling assumptions (1/3)

Stochastic approximation. Given a filtration \(\{\ALPHABET F_t\}_{t \ge 0}\), adapted sequences \(\{α_t\}_{t \ge 0}\) and \(\{\xi_t\}_{t \ge 1}\), conisder the iteration:

\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]

Assumptions on the function

F1. There is a unique \(θ^* \in \reals^d\) such that \(f(θ^*) = 0\).

F2. The function \(f\) is globally Lipschitz continuous with constant \(L\). \[ \NORM{θ_1 - θ_2}_2 \le L \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]

Modeling assumptions (2/3)

\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]

Assumptions on the noise

N1. Martingale difference noise. The noise is a Martingale difference sequence, i.e., \[ \EXP_t[ ξ_{t+1} ] = 0, \hskip 0.5em a.s. \]

N2. Controlled growth of variance. There exists a \(σ^2\) such that \[ \EXP_t[ \NORM{ξ_{t+1}}^2_2 ] \le σ^2(1 + \NORM{θ - θ^*}_2^2), \hskip 0.5em a.s. \]

Modeling assumptions (3/3)

Assumption on the Lyapunov function

There exists a twice-differentiable Lyapunov fn \(V \colon \reals^d \to \reals\) that satisfies

V1. Squared norm equivalent. There exist \(a,b > 0\) such that \[ a \NORM{θ - θ^*}_2^2 \le V(θ) \le b \NORM{θ - θ^*}_2^2, \quad \forall θ \in \reals^d. \]

V2. Smoothness. There exists \(M > 0\) such that \[ \NORM{\GRAD V(θ_0) - \GRAD V(θ_2) }_2 \le 2M \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]

V4. Exponential stability. There exists \(ρ > 0\) such that \[ \IP{\GRAD V(θ)}{f(θ)} < - ρ V(θ), \quad \forall θ \in \reals^d \setminus \{θ^*\}. \]

Implication of the assumptions

Implication 1 (F1), (F2), and (V1) imply \[ \NORM{f(θ)}_2^2 = \NORM{f(θ) - f(θ^*)}_2^2 \le L \NORM{θ_t - θ^*}_2^2 \le \textcolor{red}{L} V(θ). \] and \[ \EXP_t[ \NORM{ξ_{t+1}}_2^2 ] \le σ^2(1 + \NORM{θ_t - θ^*}_2^2) \le \textcolor{red}{σ^2}(1 + V(θ_t)) \]

Implication 2 (V2) implies \[ V(θ + η) \le V(θ) + \IP{\GRAD V(θ)}{η} + M \NORM{η}^2_2. \]

Key Idea 1: Lyapunov drift plus diffusion

  • From implication 2, we get \[ V(θ_{t+1}) \le V(θ_t) + α_t \IP{\GRAD V(θ_t)}{ f(θ_t) + ξ_{t+1}} + M α_t^2 \NORM{ f(θ_t) + ξ_{t+1} }_2^2. \]
  • From (N1), (N2), and implication 1, we get \[ \EXP_t[V(θ_{t+1})] \le V(θ_t) + α_t \IP{\GRAD V(θ_t)}{f(θ_t)} + M α_t^2 \bigl[L V(θ_t) + σ^2(1 + V(θ_t)) \bigr] \]
  • Taking another expectation assuming \(\{α_t\}_{t \ge 0}\) is determinisitc, we get \[ \EXP[ V(θ_{t+1}) ] \le (1 - ρ α_t + α_t^2 (\textcolor{red}{L^2} + \textcolor{red}{σ^2})) \EXP[ V(θ_t) ] + α_t^2 \textcolor{red}{σ^2} \]

Lyapunov drift plus diffusion

Key Idea 2: Discrete Gronwall inequality

Let \(\{a_t\}_{t \ge 0}\), \(\{b_t\}_{t \ge 0}\), \(\{u_t\}_{t \ge 0}\) be real-valued sequences with \(1 + a_t > 0\) that satisfy \[ u_{t+1} \le (1 + a_t)u_t + b_t. \] Define \(Φ_{t,T} = \prod_{s=t}^{T-1}(1 + a_s)\). Then,

\[ u_T \le Φ_{0,T} u_0 + \sum_{t=0}^{T-1} Φ_{t+1,T}b_k. \]

Implication of Gronwall inequaity

Let \(Φ_{t,T} = \sum_{s=t}^{T-1} (1 - ρ α_t + α_t^2(L^2 + σ^2))\). Then \[ \EXP[V(Θ_T)] \le Φ_{0,T} \EXP[ V(θ_0) ] + σ^2 \sum_{t=0}^{T-1} Φ_{t+1,T} α_t^2 \]

  • The first term may be viewed as zero input response of the system
  • The second term may be viewed as zero state response of the system

General idea to characterize rates

  • If \(\{α_t\}_{t \ge 0}\) is a decreasing sequence, at some time \(T_0\), \(α_t\) becomes smaller than \(ρ/2(L^2 + σ^2)\). Then, \[ 1 - ρ α_t + α_t^2 (L^2 + σ^2) < 1 - \tfrac 12 ρ α_t, \quad \forall t \ge T_0. \]

  • Therefore, for \(t \ge T_0\), \[ Φ_{t+1,T} \le \prod_{s=t+1}^{T-1}(1 - \tfrac 12 ρ α_s) \le \exp\left(-\frac ρ2 \sum_{s=t+1}^{T-1} α_s \right). \]

  • So, for the zero input response, we simply need the rate of growth of \(\sum α_s\)

  • The zero state response is trickier (more on that later).

Rate of growth of \(\sum α_s\)

  • Assume \(α_t = C/(t+1)^q\), for \(q > 0\).
  • By bounding discrete sum by integral, we can show that

\[ \sum_{t=0}^{T-1} \dfrac{1}{(t+1)^q} \le φ_q(T) \coloneqq \begin{cases} \dfrac{T^{1-q}}{1-q}, & q \in (0,1) \\ \log T, & q = 1 \\ < ∞, & q > 1 \end{cases} \]

  • Moreover for \(t < T\) \[ \frac 12 \bigl[ φ_q(T) - φ_q(t) \bigr] \le \sum_{s=t}^{T-1} \dfrac{1}{(s+1)^q} \le φ_q(T) - φ_q(t). \]

Case 1: \(q \in (0.5, 1)\)

  • From rate of growth of \(\sum α_s\) we have \[ Φ_{0,T} \le \exp\left( - \frac{ρC}{2(1-q)} T^{1-q} \right) \] which decays sub-exponentially.

  • Moreover, assume that \(T_0\) is large enough so that \(1/(t+1) < C/(t+1)^q\). Then, for any \(t +1 \ge T_0\), \[ Φ_{t+1, T} \le \prod_{s=t+1}^{T-1} \left( 1 - \frac{1}{s+1} \right) = \frac{t}{T} \]

Case 1: \(q \in (0.5, 1)\)

  • The zero input term decays as \[ Φ_{0,T} \EXP[ V(θ_0) ] \le \EXP[ V(θ_0) ]\exp\left( - \frac{ρC}{2(1-q)} T^{1-q} \right) \]

  • The tail of zero state term decays as \[ σ^2 \sum_{t=\textcolor{red}{T_0}}^{T-1} Φ_{t+1,T} α_t^2 \le \frac{α^2C}{T} \sum_{t=T_0} \frac{1}{(t+1)^{2q-1}} \le \frac{α^2C}{2(1-q)} \frac{1}{T^{1-2q}} \]

Case 1: \(q \in (0.5, 1)\)

  • Overall Rate \[\begin{align*} \EXP[V(θ_{T})] &\le [ \EXP[V(θ_0)] + \text{const}(T_0) ] \exp\left( - \tfrac{ρC}{2(1-q)} T^{1-q} \right) \\ &\quad + \frac{C σ^2}{2(1-q)} \frac{1}{T^{1-2q}} \quad \le \mathcal{O}\left(\frac 1{T^{1-2q}}\right) \end{align*}\]
  • The analysis above is simplified from Bach and Moulines (2011) and Hu and Fu (2025), which present the results in the setting of stochastic gradient descent.

Case 2: \(q = 1\)

  • From rate of growth of \(\sum α_s\), we have \[ Φ_{t+1,T} \le \exp\left(- \frac{ρC}{2} \log \left(\frac{T}{t+1}\right)\right) = \left(\frac{t+1}{T}\right)^{ρC/2}. \]

  • The zero input term decays as \[ Φ_{0,T} \EXP[ V(θ_0) ] \le \EXP[ V(θ_0) ] \left(\frac{1}{T}\right)^{ρC/2} \]

  • The tail of zero state term decays as \[ σ^2 \sum_{t=\textcolor{red}{T_0}}^{T-1} Φ_{t+1,T} α_t^2 \le \frac{C^2}{T^{ρC}} \sum_{t=T_0}^{T-1} \frac{1}{(t+1)^{2 - \frac{ρC}2}} \le \frac{C^2}{T^{ρC}} Φ_{2 - ρC/2}(T) \]

Case 2: \(q = 1\)

  • Overall Rate depends on \(ρC\)

    \[ \EXP[ V(θ_T) ] \le \begin{dcases} \mathcal{O}\left( \frac 1T \right), & ρC > 2 \\[5pt] \mathcal{O}\left( \frac {\log(T)}T \right), & ρC = 2 \\[5pt] \mathcal{O}\left( \frac {1}{T^{ρC/2}} \right), & ρC < 2 \end{dcases} \]

Back to the example

Stochastic approximation with \(f(θ) = -0.5 θ\) and \(α_t = C/(t+1)\)

  • The ODE \(\dot θ = f(θ)\) is globally asymptotically stable with Lyapunov function \[ V(θ) = \ABS{θ}^2. \]

  • The system is exponentially stable, since \[ \IP{ \GRAD V(θ)}{f(θ)} = -\ABS{θ}^2 = - V(θ). \]

Back to the example

Stochastic approximation with \(f(θ) = -0.5 θ\) and \(α_t = C/(t+1)\)

Exponential stability is important

Stochastic approximation with \(f(θ) = -θ^3/(1 + θ^2)\) and \(α_t = C/(t+1)\)

  • The ODE \(\dot θ = f(θ)\) is globally asymptotically stable with Lyapunov function \[ V(θ) = \ABS{θ}^2. \]

  • The system is not exponentially stable, since \[ \IP{ \GRAD V(θ)}{f(θ)} = -\frac{2 θ^4}{1 + θ^2} \textcolor{red}{\neq} -ρ V(θ). \]

Exponential stability is important

Stochastic approximation with \(f(θ) = -θ^3/(1 + θ^2)\) and \(α_t = C/(t+1)\)

Rate of convergence without exponential stability

  • Suppose the Lyapunov function satisfies

V4’. There exists a \(ρ > 0\) and \(c > 0\) such that for all \(θ \in \reals^d\), \[ \IP{ \GRAD V(θ) }{ f(θ) } \le c - ρV(θ) \]

  • Use random stopping (Nemirovski et al. 2009). Run stochastic approximation for a random time \(τ \in \{T_0, \dots, T_1\}\), where \[ \PR(τ = t) = \frac{α_t}{α_{T_0} + \cdots + α_{T_1}}, \quad t \in \{T_0, \dots, T_1\}. \]

Rate of convergence without exponential stability

  • Almost the same analysis as before shows that if \(α_t < ρ/2(L^2+α^2)\) for all \(t \ge T_0\), then (using \(Δ_t = \EXP[ V(θ_t) ]\)), \[ \EXP[Δ_{τ}] \le \frac{(Δ_{T_1+1}-Δ_{T_0}) + σ^2 \sum_{t=T_0}^{T_1} α_t^2}{\sum_{t={T_0}}^{T_1} α_t} + c \]
  • Taking \(α_t = C/\sqrt{(t+1)}\) gives \[ \EXP[Δ_{τ}] \le \mathcal{O}\left( \frac{\log T}{\sqrt{T}} \right). \]

  • Poorer rate, but does not require exponential stability assumption.

Some remarks about RL setting (1/2)

  • Many RL algorithms may be viewed as \[ θ_{t+1} = θ_t + α_t[ H(θ_t) - θ_t + ξ_{t+1} ] \] where \(H \colon \reals^d \to \reals^d\) is a \(γ\)-contraction, \(γ \in (0,1)\).
  • If \(H\) is a \(p\)-norm contraction, \(p \in (1, ∞]\) then the ODE \(\dot θ = H(θ) - θ\) is globally asymptotically stable (Borkar and Soumyanatha 1997) and \[ V(θ) = \NORM{θ - θ^*}_p \] is a Lyapunov function for the ODE which satisfies \[ \IP{ \GRAD V(θ) }{ H(θ) - θ } \le -(1-γ)V(θ). \]

Some remarks about RL setting (2/2)

  • In some settings, \(H\) is a weighted \(\ell_2\)-norm contraction.

    • The Lyapunov function satisfies (V1)–(V4), and the above bounds hold.
  • In general, \(H\) is an \(\ell_{\infty}\)-norm contraction.

    • The Lyapunov function does not satisfy (V2) (smoothness).

    • Can consider Moreau envelop \[ M_{λ}(θ) = \min_{φ \in \reals^d} \left\{ \NORM{φ}_{∞} + \frac{1}{2 λ} \NORM{θ - φ}_2^2 \right\} \] which is close to \(\NORM{⋅}_{∞}\) and is \(1/λ\)-smooth (Chen et al. 2020).

Conclusion

  • Simple derivation of non-asymptotic rate of convergence for stochastic approximation

  • Key ideas:

    • Lyapunov drift plus diffusion
    • Discrete Gronwall inequality
    • Some algebra for \(\sum t^{-q}\).
  • Quite a rich area. Lot of interest in relaxing the regularity assumptions on \(f\) and \(V\), depending on specific application.

(Dieuleveut et al. 2023)

References

Bach, F. and Moulines, E. 2011. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, Curran Associates, Inc. Available at: https://papers.nips.cc/paper/2011/hash/40008b9a5380fcacce3976bf7c08af5b-Abstract.html.
Borkar, V.S. and Soumyanatha, K. 1997. An analog scheme for fixed point computation. I. theory. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44, 4, 351–355. DOI: 10.1109/81.563625.
Chen, Z., Maguluri, S.T., Shakkottai, S., and Shanmugam, K. 2020. Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Advances in neural information processing systems, Curran Associates, Inc., 8223–8234. Available at: https://proceedings.neurips.cc/paper/2020/hash/5d44ee6f2c3f71b73125876103c8f6c4-Abstract.html.
Dieuleveut, A., Fort, G., Moulines, E., and Wai, H.-T. 2023. Stochastic approximation beyond gradient for signal processing and machine learning. IEEE Transactions on Signal Processing 71, 3117–3148. DOI: 10.1109/tsp.2023.3301121.
Hu, J. and Fu, M.C. 2025. Technical note—on the convergence rate of stochastic approximation for gradient-based stochastic optimization. Operations Research 73, 2, 1143–1150. DOI: 10.1287/opre.2023.0055.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19, 4, 1574–1609. DOI: 10.1137/070704277.